DT2002 <- data.table::fread("ad_viz_plotval_data.csv")
DT2022 <- data.table::fread("ad_viz_plotval_data (1).csv")Assignment 1
Due Date
This assignment is due by midnight Pacific Time, September 27th, 2024.
Learning Goals
- Download, read, and get familiar with an external dataset.
- Step through the EDA “checklist” presented in class
- Practice making exploratory plots
Assignment Description
We will work with air pollution data from the U.S. Environmental Protection Agency (EPA). The EPA has a national monitoring network of air pollution sites that The primary question you will answer is whether daily concentrations of PM\(_{2.5}\) (particulate matter air pollution with aerodynamic diameter less than 2.5 \(\mu\)m) have decreased in California over the last 20 years (from 2002 to 2022).
A primer on particulate matter air pollution can be found here.
Your assignment should be completed in Quarto or R Markdown.
Steps
Given the formulated question from the assignment description, you will now conduct EDA Checklist items 2-4. First, download 2002 and 2022 data for all sites in California from the EPA Air Quality Data website. Read in the data using
data.table(). For each of the two datasets, check the dimensions, headers, footers, variable names and variable types. Check for any data issues, particularly in the key variable we are analyzing. Make sure you write up a summary of all of your findings.Read Tables into R
Check the dimensions, headers, footers, variable names and variable types for 2002
dim(DT2002)[1] 15976 22head(DT2002)Date Source Site ID POC Daily Mean PM2.5 Concentration Units <char> <char> <int> <int> <num> <char> 1: 01/05/2002 AQS 60010007 1 25.1 ug/m3 LC 2: 01/06/2002 AQS 60010007 1 31.6 ug/m3 LC 3: 01/08/2002 AQS 60010007 1 21.4 ug/m3 LC 4: 01/11/2002 AQS 60010007 1 25.9 ug/m3 LC 5: 01/14/2002 AQS 60010007 1 34.5 ug/m3 LC 6: 01/17/2002 AQS 60010007 1 41.0 ug/m3 LC Daily AQI Value Local Site Name Daily Obs Count Percent Complete <int> <char> <int> <num> 1: 81 Livermore 1 100 2: 93 Livermore 1 100 3: 74 Livermore 1 100 4: 82 Livermore 1 100 5: 98 Livermore 1 100 6: 115 Livermore 1 100 AQS Parameter Code AQS Parameter Description Method Code <int> <char> <int> 1: 88101 PM2.5 - Local Conditions 120 2: 88101 PM2.5 - Local Conditions 120 3: 88101 PM2.5 - Local Conditions 120 4: 88101 PM2.5 - Local Conditions 120 5: 88101 PM2.5 - Local Conditions 120 6: 88101 PM2.5 - Local Conditions 120 Method Description CBSA Code <char> <int> 1: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS 41860 2: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS 41860 3: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS 41860 4: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS 41860 5: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS 41860 6: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS 41860 CBSA Name State FIPS Code State <char> <int> <char> 1: San Francisco-Oakland-Hayward, CA 6 California 2: San Francisco-Oakland-Hayward, CA 6 California 3: San Francisco-Oakland-Hayward, CA 6 California 4: San Francisco-Oakland-Hayward, CA 6 California 5: San Francisco-Oakland-Hayward, CA 6 California 6: San Francisco-Oakland-Hayward, CA 6 California County FIPS Code County Site Latitude Site Longitude <int> <char> <num> <num> 1: 1 Alameda 37.68753 -121.7842 2: 1 Alameda 37.68753 -121.7842 3: 1 Alameda 37.68753 -121.7842 4: 1 Alameda 37.68753 -121.7842 5: 1 Alameda 37.68753 -121.7842 6: 1 Alameda 37.68753 -121.7842tail(DT2002)Date Source Site ID POC Daily Mean PM2.5 Concentration Units <char> <char> <int> <int> <num> <char> 1: 12/10/2002 AQS 61131003 1 15 ug/m3 LC 2: 12/13/2002 AQS 61131003 1 15 ug/m3 LC 3: 12/22/2002 AQS 61131003 1 1 ug/m3 LC 4: 12/25/2002 AQS 61131003 1 23 ug/m3 LC 5: 12/28/2002 AQS 61131003 1 5 ug/m3 LC 6: 12/31/2002 AQS 61131003 1 6 ug/m3 LC Daily AQI Value Local Site Name Daily Obs Count Percent Complete <int> <char> <int> <num> 1: 62 Woodland-Gibson Road 1 100 2: 62 Woodland-Gibson Road 1 100 3: 6 Woodland-Gibson Road 1 100 4: 77 Woodland-Gibson Road 1 100 5: 28 Woodland-Gibson Road 1 100 6: 33 Woodland-Gibson Road 1 100 AQS Parameter Code AQS Parameter Description Method Code <int> <char> <int> 1: 88101 PM2.5 - Local Conditions 117 2: 88101 PM2.5 - Local Conditions 117 3: 88101 PM2.5 - Local Conditions 117 4: 88101 PM2.5 - Local Conditions 117 5: 88101 PM2.5 - Local Conditions 117 6: 88101 PM2.5 - Local Conditions 117 Method Description CBSA Code <char> <int> 1: R & P Model 2000 PM2.5 Sampler w/WINS 40900 2: R & P Model 2000 PM2.5 Sampler w/WINS 40900 3: R & P Model 2000 PM2.5 Sampler w/WINS 40900 4: R & P Model 2000 PM2.5 Sampler w/WINS 40900 5: R & P Model 2000 PM2.5 Sampler w/WINS 40900 6: R & P Model 2000 PM2.5 Sampler w/WINS 40900 CBSA Name State FIPS Code State <char> <int> <char> 1: Sacramento--Roseville--Arden-Arcade, CA 6 California 2: Sacramento--Roseville--Arden-Arcade, CA 6 California 3: Sacramento--Roseville--Arden-Arcade, CA 6 California 4: Sacramento--Roseville--Arden-Arcade, CA 6 California 5: Sacramento--Roseville--Arden-Arcade, CA 6 California 6: Sacramento--Roseville--Arden-Arcade, CA 6 California County FIPS Code County Site Latitude Site Longitude <int> <char> <num> <num> 1: 113 Yolo 38.66121 -121.7327 2: 113 Yolo 38.66121 -121.7327 3: 113 Yolo 38.66121 -121.7327 4: 113 Yolo 38.66121 -121.7327 5: 113 Yolo 38.66121 -121.7327 6: 113 Yolo 38.66121 -121.7327str(DT2002)Classes 'data.table' and 'data.frame': 15976 obs. of 22 variables: $ Date : chr "01/05/2002" "01/06/2002" "01/08/2002" "01/11/2002" ... $ Source : chr "AQS" "AQS" "AQS" "AQS" ... $ Site ID : int 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 ... $ POC : int 1 1 1 1 1 1 1 1 1 1 ... $ Daily Mean PM2.5 Concentration: num 25.1 31.6 21.4 25.9 34.5 41 29.3 15 18.8 37.9 ... $ Units : chr "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ... $ Daily AQI Value : int 81 93 74 82 98 115 89 62 69 107 ... $ Local Site Name : chr "Livermore" "Livermore" "Livermore" "Livermore" ... $ Daily Obs Count : int 1 1 1 1 1 1 1 1 1 1 ... $ Percent Complete : num 100 100 100 100 100 100 100 100 100 100 ... $ AQS Parameter Code : int 88101 88101 88101 88101 88101 88101 88101 88101 88101 88101 ... $ AQS Parameter Description : chr "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" ... $ Method Code : int 120 120 120 120 120 120 120 120 120 120 ... $ Method Description : chr "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" ... $ CBSA Code : int 41860 41860 41860 41860 41860 41860 41860 41860 41860 41860 ... $ CBSA Name : chr "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ... $ State FIPS Code : int 6 6 6 6 6 6 6 6 6 6 ... $ State : chr "California" "California" "California" "California" ... $ County FIPS Code : int 1 1 1 1 1 1 1 1 1 1 ... $ County : chr "Alameda" "Alameda" "Alameda" "Alameda" ... $ Site Latitude : num 37.7 37.7 37.7 37.7 37.7 ... $ Site Longitude : num -122 -122 -122 -122 -122 ... - attr(*, ".internal.selfref")=<externalptr>
There are 15976 rows and 22 columns for the 2002 data set. The header and footer are properly loaded with no apparent missing data.
Variable names are Date, Source, Site ID, POC, Daily Mean PM2.5 Concentration, Units, Daily AQI Value, Local Site Name, Daily Obs Count, Percent Complete, AQS Parameter Code, AQS, Parameter Description, Method Code, Method Description, CBSA Code, CBSA Name, State FIPS Code, State, County FIPS Code, County, Site Latitude, and Site Longitude.
Categorical variables: Date, Source, Units, Local Site Nemw, AQS Parameter Description, Method Description, CBSA Name, State, and County.
Numeric variables: Site ID, POC, Daily Mean PM2.5 Concentration, Daily AQI Value, Daily Obs Count, Percent Complete, AQS Parameter Code, AQS, Method Code, CBSA Code, State FIPS Code, County FIPS Code, Site Latitude, and Site Longitude.
Check the dimensions, headers, footers, variable names and variable types for 2022
::: {.cell}
```{.r .cell-code}
dim(DT2022)
```
::: {.cell-output .cell-output-stdout}
```
[1] 59756 22
```
:::
```{.r .cell-code}
head(DT2022)
```
::: {.cell-output .cell-output-stdout}
```
Date Source Site ID POC Daily Mean PM2.5 Concentration Units
<char> <char> <int> <int> <num> <char>
1: 01/01/2022 AQS 60010007 3 12.7 ug/m3 LC
2: 01/02/2022 AQS 60010007 3 13.9 ug/m3 LC
3: 01/03/2022 AQS 60010007 3 7.1 ug/m3 LC
4: 01/04/2022 AQS 60010007 3 3.7 ug/m3 LC
5: 01/05/2022 AQS 60010007 3 4.2 ug/m3 LC
6: 01/06/2022 AQS 60010007 3 3.8 ug/m3 LC
Daily AQI Value Local Site Name Daily Obs Count Percent Complete
<int> <char> <int> <num>
1: 58 Livermore 1 100
2: 60 Livermore 1 100
3: 39 Livermore 1 100
4: 21 Livermore 1 100
5: 23 Livermore 1 100
6: 21 Livermore 1 100
AQS Parameter Code AQS Parameter Description Method Code
<int> <char> <int>
1: 88101 PM2.5 - Local Conditions 170
2: 88101 PM2.5 - Local Conditions 170
3: 88101 PM2.5 - Local Conditions 170
4: 88101 PM2.5 - Local Conditions 170
5: 88101 PM2.5 - Local Conditions 170
6: 88101 PM2.5 - Local Conditions 170
Method Description CBSA Code
<char> <int>
1: Met One BAM-1020 Mass Monitor w/VSCC 41860
2: Met One BAM-1020 Mass Monitor w/VSCC 41860
3: Met One BAM-1020 Mass Monitor w/VSCC 41860
4: Met One BAM-1020 Mass Monitor w/VSCC 41860
5: Met One BAM-1020 Mass Monitor w/VSCC 41860
6: Met One BAM-1020 Mass Monitor w/VSCC 41860
CBSA Name State FIPS Code State
<char> <int> <char>
1: San Francisco-Oakland-Hayward, CA 6 California
2: San Francisco-Oakland-Hayward, CA 6 California
3: San Francisco-Oakland-Hayward, CA 6 California
4: San Francisco-Oakland-Hayward, CA 6 California
5: San Francisco-Oakland-Hayward, CA 6 California
6: San Francisco-Oakland-Hayward, CA 6 California
County FIPS Code County Site Latitude Site Longitude
<int> <char> <num> <num>
1: 1 Alameda 37.68753 -121.7842
2: 1 Alameda 37.68753 -121.7842
3: 1 Alameda 37.68753 -121.7842
4: 1 Alameda 37.68753 -121.7842
5: 1 Alameda 37.68753 -121.7842
6: 1 Alameda 37.68753 -121.7842
```
:::
```{.r .cell-code}
tail(DT2022)
```
::: {.cell-output .cell-output-stdout}
```
Date Source Site ID POC Daily Mean PM2.5 Concentration Units
<char> <char> <int> <int> <num> <char>
1: 12/01/2022 AQS 61131003 1 3.4 ug/m3 LC
2: 12/07/2022 AQS 61131003 1 3.8 ug/m3 LC
3: 12/13/2022 AQS 61131003 1 6.0 ug/m3 LC
4: 12/19/2022 AQS 61131003 1 34.8 ug/m3 LC
5: 12/25/2022 AQS 61131003 1 23.2 ug/m3 LC
6: 12/31/2022 AQS 61131003 1 1.0 ug/m3 LC
Daily AQI Value Local Site Name Daily Obs Count Percent Complete
<int> <char> <int> <num>
1: 19 Woodland-Gibson Road 1 100
2: 21 Woodland-Gibson Road 1 100
3: 33 Woodland-Gibson Road 1 100
4: 99 Woodland-Gibson Road 1 100
5: 77 Woodland-Gibson Road 1 100
6: 6 Woodland-Gibson Road 1 100
AQS Parameter Code AQS Parameter Description Method Code
<int> <char> <int>
1: 88101 PM2.5 - Local Conditions 145
2: 88101 PM2.5 - Local Conditions 145
3: 88101 PM2.5 - Local Conditions 145
4: 88101 PM2.5 - Local Conditions 145
5: 88101 PM2.5 - Local Conditions 145
6: 88101 PM2.5 - Local Conditions 145
Method Description CBSA Code
<char> <int>
1: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900
2: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900
3: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900
4: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900
5: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900
6: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900
CBSA Name State FIPS Code State
<char> <int> <char>
1: Sacramento--Roseville--Arden-Arcade, CA 6 California
2: Sacramento--Roseville--Arden-Arcade, CA 6 California
3: Sacramento--Roseville--Arden-Arcade, CA 6 California
4: Sacramento--Roseville--Arden-Arcade, CA 6 California
5: Sacramento--Roseville--Arden-Arcade, CA 6 California
6: Sacramento--Roseville--Arden-Arcade, CA 6 California
County FIPS Code County Site Latitude Site Longitude
<int> <char> <num> <num>
1: 113 Yolo 38.66121 -121.7327
2: 113 Yolo 38.66121 -121.7327
3: 113 Yolo 38.66121 -121.7327
4: 113 Yolo 38.66121 -121.7327
5: 113 Yolo 38.66121 -121.7327
6: 113 Yolo 38.66121 -121.7327
```
:::
```{.r .cell-code}
str(DT2022)
```
::: {.cell-output .cell-output-stdout}
```
Classes 'data.table' and 'data.frame': 59756 obs. of 22 variables:
$ Date : chr "01/01/2022" "01/02/2022" "01/03/2022" "01/04/2022" ...
$ Source : chr "AQS" "AQS" "AQS" "AQS" ...
$ Site ID : int 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 ...
$ POC : int 3 3 3 3 3 3 3 3 3 3 ...
$ Daily Mean PM2.5 Concentration: num 12.7 13.9 7.1 3.7 4.2 3.8 2.3 6.9 13.6 11.2 ...
$ Units : chr "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
$ Daily AQI Value : int 58 60 39 21 23 21 13 38 59 55 ...
$ Local Site Name : chr "Livermore" "Livermore" "Livermore" "Livermore" ...
$ Daily Obs Count : int 1 1 1 1 1 1 1 1 1 1 ...
$ Percent Complete : num 100 100 100 100 100 100 100 100 100 100 ...
$ AQS Parameter Code : int 88101 88101 88101 88101 88101 88101 88101 88101 88101 88101 ...
$ AQS Parameter Description : chr "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" ...
$ Method Code : int 170 170 170 170 170 170 170 170 170 170 ...
$ Method Description : chr "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" ...
$ CBSA Code : int 41860 41860 41860 41860 41860 41860 41860 41860 41860 41860 ...
$ CBSA Name : chr "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ...
$ State FIPS Code : int 6 6 6 6 6 6 6 6 6 6 ...
$ State : chr "California" "California" "California" "California" ...
$ County FIPS Code : int 1 1 1 1 1 1 1 1 1 1 ...
$ County : chr "Alameda" "Alameda" "Alameda" "Alameda" ...
$ Site Latitude : num 37.7 37.7 37.7 37.7 37.7 ...
$ Site Longitude : num -122 -122 -122 -122 -122 ...
- attr(*, ".internal.selfref")=<externalptr>
```
:::
:::
There are 59756 rows and 22 columns for the 2022 data set. The header and footer are properly loaded with no apparent missing data. All variable names and types are the same as in the 2002 data set.
Combine the two years of data into one data frame. Use the Date variable to create a new column for year, which will serve as an identifier. Change the names of the key variables so that they are easier to refer to in your code.
library(dplyr)Attaching package: 'dplyr'The following objects are masked from 'package:stats': filter, lagThe following objects are masked from 'package:base': intersect, setdiff, setequal, union# Combine the tables DT <- rbind(DT2002, DT2022) # Create a new column for Year DT$Date <- as.Date(DT$Date, format = "%m/%d/%Y") DT$Year <- format(DT$Date, "%Y") # Change the names of key variables DT <- DT |> rename("PM2.5" = "Daily Mean PM2.5 Concentration", "lat" = "Site Latitude", "lon" = "Site Longitude") DT$Year <- as.numeric(as.character(DT$Year)) # Double check variables head(DT)Date Source Site ID POC PM2.5 Units Daily AQI Value <Date> <char> <int> <int> <num> <char> <int> 1: 2002-01-05 AQS 60010007 1 25.1 ug/m3 LC 81 2: 2002-01-06 AQS 60010007 1 31.6 ug/m3 LC 93 3: 2002-01-08 AQS 60010007 1 21.4 ug/m3 LC 74 4: 2002-01-11 AQS 60010007 1 25.9 ug/m3 LC 82 5: 2002-01-14 AQS 60010007 1 34.5 ug/m3 LC 98 6: 2002-01-17 AQS 60010007 1 41.0 ug/m3 LC 115 Local Site Name Daily Obs Count Percent Complete AQS Parameter Code <char> <int> <num> <int> 1: Livermore 1 100 88101 2: Livermore 1 100 88101 3: Livermore 1 100 88101 4: Livermore 1 100 88101 5: Livermore 1 100 88101 6: Livermore 1 100 88101 AQS Parameter Description Method Code Method Description <char> <int> <char> 1: PM2.5 - Local Conditions 120 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS 2: PM2.5 - Local Conditions 120 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS 3: PM2.5 - Local Conditions 120 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS 4: PM2.5 - Local Conditions 120 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS 5: PM2.5 - Local Conditions 120 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS 6: PM2.5 - Local Conditions 120 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS CBSA Code CBSA Name State FIPS Code State <int> <char> <int> <char> 1: 41860 San Francisco-Oakland-Hayward, CA 6 California 2: 41860 San Francisco-Oakland-Hayward, CA 6 California 3: 41860 San Francisco-Oakland-Hayward, CA 6 California 4: 41860 San Francisco-Oakland-Hayward, CA 6 California 5: 41860 San Francisco-Oakland-Hayward, CA 6 California 6: 41860 San Francisco-Oakland-Hayward, CA 6 California County FIPS Code County lat lon Year <int> <char> <num> <num> <num> 1: 1 Alameda 37.68753 -121.7842 2002 2: 1 Alameda 37.68753 -121.7842 2002 3: 1 Alameda 37.68753 -121.7842 2002 4: 1 Alameda 37.68753 -121.7842 2002 5: 1 Alameda 37.68753 -121.7842 2002 6: 1 Alameda 37.68753 -121.7842 2002tail(DT)Date Source Site ID POC PM2.5 Units Daily AQI Value <Date> <char> <int> <int> <num> <char> <int> 1: 2022-12-01 AQS 61131003 1 3.4 ug/m3 LC 19 2: 2022-12-07 AQS 61131003 1 3.8 ug/m3 LC 21 3: 2022-12-13 AQS 61131003 1 6.0 ug/m3 LC 33 4: 2022-12-19 AQS 61131003 1 34.8 ug/m3 LC 99 5: 2022-12-25 AQS 61131003 1 23.2 ug/m3 LC 77 6: 2022-12-31 AQS 61131003 1 1.0 ug/m3 LC 6 Local Site Name Daily Obs Count Percent Complete AQS Parameter Code <char> <int> <num> <int> 1: Woodland-Gibson Road 1 100 88101 2: Woodland-Gibson Road 1 100 88101 3: Woodland-Gibson Road 1 100 88101 4: Woodland-Gibson Road 1 100 88101 5: Woodland-Gibson Road 1 100 88101 6: Woodland-Gibson Road 1 100 88101 AQS Parameter Description Method Code <char> <int> 1: PM2.5 - Local Conditions 145 2: PM2.5 - Local Conditions 145 3: PM2.5 - Local Conditions 145 4: PM2.5 - Local Conditions 145 5: PM2.5 - Local Conditions 145 6: PM2.5 - Local Conditions 145 Method Description CBSA Code <char> <int> 1: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900 2: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900 3: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900 4: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900 5: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900 6: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900 CBSA Name State FIPS Code State <char> <int> <char> 1: Sacramento--Roseville--Arden-Arcade, CA 6 California 2: Sacramento--Roseville--Arden-Arcade, CA 6 California 3: Sacramento--Roseville--Arden-Arcade, CA 6 California 4: Sacramento--Roseville--Arden-Arcade, CA 6 California 5: Sacramento--Roseville--Arden-Arcade, CA 6 California 6: Sacramento--Roseville--Arden-Arcade, CA 6 California County FIPS Code County lat lon Year <int> <char> <num> <num> <num> 1: 113 Yolo 38.66121 -121.7327 2022 2: 113 Yolo 38.66121 -121.7327 2022 3: 113 Yolo 38.66121 -121.7327 2022 4: 113 Yolo 38.66121 -121.7327 2022 5: 113 Yolo 38.66121 -121.7327 2022 6: 113 Yolo 38.66121 -121.7327 2022Create a basic map in leaflet() that shows the locations of the sites (make sure to use different colors for each year). Summarize the spatial distribution of the monitoring sites.
library("leaflet") color_palette <- colorNumeric(palette = "viridis", domain = DT$Year) leaflet(DT) |> addProviderTiles('OpenStreetMap') |> addCircles(lat=~lat, lng=~lon, opacity=1, fillOpacity=1, radius=100, color=~color_palette(Year), fillColor=~color_palette(Year),)
Monitoring sites appear to be distributed with a higher density in locations of higher population density. For instance, cities and along the coastline have many more monitoring sites than in the mountain ranges. Specifically, there is a high density around the Bay Area, Los Angeles, and San Diago. This seems logical because we would like to know air pollution levels where people are living and more likely to pollute the air.
Check for any missing or implausible values of PM\(_{2.5}\) in the combined dataset. Explore the proportions of each and provide a summary of any temporal patterns you see in these observations.
sum(is.na(DT$PM2.5))[1] 0# There are no missing values of PM2.5. summary(DT$PM2.5)Min. 1st Qu. Median Mean 3rd Qu. Max. -6.70 4.50 7.60 10.05 12.20 302.50# Set a maximum value of 500 ug/m^3 (as given by the 2012 EPA) and a minimum value of 0: max_PM <- 500 min_PM <- 0 impossible <- DT$PM2.5[DT$PM2.5 < min_PM | DT$PM2.5 > max_PM] sum_impossible <- length(impossible) print(sum_impossible)[1] 215# It appears that all impossible values are very close to 0, so they will all be set at 0 DT_new <- DT DT_new$PM2.5 <- ifelse(DT_new$PM2.5 < 0, 0, DT_new$PM2.5) # Find the proportion of impossible data prop <- sum_impossible/length(DT$PM2.5)*100 print(prop)[1] 0.2838958# Only 0.284% of the data is impossible # Temporal summary temporal <- DT[DT$PM2.5 < 0, .(Count = .N), by = Date] temporal <- temporal[order(-Count)] print(temporal)Date Count <Date> <int> 1: 2022-12-31 11 2: 2022-07-06 8 3: 2022-12-30 8 4: 2022-09-19 7 5: 2022-12-11 6 --- 114: 2022-02-08 1 115: 2022-02-10 1 116: 2022-01-15 1 117: 2022-02-02 1 118: 2022-04-23 1# All missing values are from 2022Explore the main question of interest at three different spatial levels. Create exploratory plots (e.g. boxplots, histograms, line plots) and summary statistics that best suit each level of data. Be sure to write up explanations of what you observe in these data.
state
DT_new$year <- as.factor(DT_new$Year) library(ggplot2) # Create a histogram ggplot(data = DT_new) + geom_histogram(aes(x = PM2.5, fill = year)) + labs(title = "PM2.5 by Year in California", x = "PM2.5")`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.# Create a box plot ggplot(data = DT_new) + geom_boxplot(aes(x = year, y = PM2.5, fill = year)) + labs(title = "PM2.5 by Year in California", x = "Year", y = "PM2.5")# Statistical summary of state data State <- DT_new[, .(Mean = mean(PM2.5), Min = min(PM2.5), Max = max(PM2.5), IQR = IQR(PM2.5)), by = Year] print(State)Year Mean Min Max IQR <num> <num> <num> <num> <num> 1: 2002 16.115943 0 104.3 13.5 2: 2022 8.431138 0 302.5 6.6Overall, the average PM2.5 for the state has decreased with an average PM2.5 of 8.4 in 2022 compared to 16.1 in 2002 and a decreased interquartile range from 13.5 to 6.6, meaning there is less average variation in the 2022 data compared to 2002. There are some counties with much higher, outlying values in 2022, slightly increasing this mean. Therefore, some counties may have worse pollution in 2022 compared to 2002, but the state as a whole has decreased air pollution levels.
county
# Create a bar graph ggplot(data = DT_new, aes(x = County, y = PM2.5, fill = year)) + geom_bar(stat = "identity", position = "dodge") + labs(title = "Pm2.5 Trends by County", x = "County", y= "PM2.5") + coord_flip()# Create a heat map ggplot(DT_new, aes(x = year, y = County, fill = PM2.5)) + geom_tile() + scale_fill_gradient(low = "white", high = "red") + labs(title = "PM2.5 2002 and 2022 by County", x = "Year", y = "County")# Statistial summary of county data County_stats <- DT_new |> group_by(year, County) |> summarize(Mean = mean(PM2.5), Max = max(PM2.5), Min = min(PM2.5), .groups = "drop") |> arrange(County) print(County_stats)# A tibble: 98 × 5 year County Mean Max Min <fct> <chr> <dbl> <dbl> <dbl> 1 2002 Alameda 14.3 61.6 1.9 2 2022 Alameda 8.20 35.5 0 3 2002 Butte 14.8 88 1 4 2022 Butte 6.19 42.8 0 5 2002 Calaveras 9.9 40 2 6 2022 Calaveras 6.04 25.9 0 7 2002 Colusa 11.7 57 1 8 2022 Colusa 7.61 37 0.6 9 2002 Contra Costa 15.1 76.7 2 10 2022 Contra Costa 8.25 37.3 0.9 # ℹ 88 more rowsMany counties with PM2.5 measurements in both 2002 and 2022 decreased their average PM2.5 values, as seen in the table and charts. Counties with very high PM2.5 values around 300 ug/m^3, in 2022 identified as outliers at a state level, have relatively low mean values, meaning one timepoint in a few counties are significantly impacting the state PM2.5 values at a state level. Thus, overall, most California counties decreased their average PM2.5 leading to lower air pollution levels in California as a whole.
site in Los Angeles
# Create a new table for LA data LA <- DT_new |> filter(County == "Los Angeles") LA_2002 <- LA |> filter(year == "2002") LA_2022 <- LA |> filter(year == "2022") # Create a line plot for LA PM2.5 in 2002 and 2022 library(gridExtra)Attaching package: 'gridExtra'The following object is masked from 'package:dplyr': combineplot_2002 <- ggplot(LA_2002, aes(x = Date, y = PM2.5)) + geom_line(color = "blue") + geom_point(color = "blue") + labs(title = "Change in PM2.5 in Los Angeles in 2002", x = "Date in 2002", y = "PM2.5") plot_2022 <- ggplot(LA_2022, aes(x = Date, y = PM2.5)) + geom_line(color = "red") + geom_point(color = "red") + labs(title = "Change in PM2.5 in Los Angeles in 2022", x = "Date in 2022", y = "PM2.5") grid.arrange(plot_2002, plot_2022, ncol = 2)# Create a box plot for 2002 and 2022 ggplot(LA, aes(x = year, y = PM2.5)) + geom_boxplot(fill = "lightblue", color = "black") + labs(title = "PM2.5 Levels in Los Angeles in 2002 vs 2022", x = "Year", y = "PM2.5")
# Statistical summary of LA data LA_stats <- LA[, .(Mean = mean(PM2.5), Min = min(PM2.5), Max = max(PM2.5), IQR = IQR(PM2.5)), by = year] print(LA_stats)year Mean Min Max IQR <fctr> <num> <num> <num> <num> 1: 2002 19.65604 0.6 72.4 14.4 2: 2022 10.97258 0.0 56.0 6.3Los Angeles significantly decreased PM2.5 levels in 2022 compared to 2002 with lower mean levels, 11 ug/m3 in 2022 vs 20 ug/m^3 in 2002, and less variation with an IQR of 14.4 in 2002 and 6.3 in 2022. Additionally, unlike some other counties, even the high outliers decreased between 2002 and 2022 with a decreased maximum PM2.5 value of 72.4 in 2002 to 56.0 in 2022. Therefore, Los Angeles county shows decreased air pollution values in 2022 compared to 2002, which is impressive for such a large and populated region.
This homework has been adapted from the case study in Roger Peng’s Exploratory Data Analysis with R